Bayes MCMC Models

Steve Elston

10/27/2022

Review

Bayesian analysis is a contrast to frequentist methods

Review

Two functions must be defined to compute the posterior distribution

Introduction

How can we extend Bayes models to more complex problems?

Grid Sampling Cannot Scale!

Real-world Bayes models have large numbers of parameters, into the millions

Scaling Bayesian models

How can we scale Bayesian models to 1000s of parameters?

Introduction to Markov Chain Monte Carlo

Large-scale Bayesian models need highly efficient sampling methods

What is a Markov process?

MCMC sampling uses a Markov processes sampling chain

What is a Markov process?

Since Markov transition process depends only on the current state a Markov process is memoryless

\[p(X_{t + 1}| X_t, X_{t-1}, X_{t-2}, \ldots, X_0) = p(X_{t + 1}| X_t)\]

What is a Markov process?

For system with \(N\) possible states we can write the transition probability matrix, \(\Pi\):

\[\begin{align} \Pi &= \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \cdots & \pi_{1, N}\\ \pi_{2,1} & \pi_{2,2} & \cdots & \pi_{2,N}\\ \cdots & \cdots & \cdots & \cdots \\ \pi_{N,i} & \pi_{N,2} & \cdots & \pi_{N,N} \end{bmatrix}\\ &where\\ \pi_{i,j} &= probability\ of\ transition\ state\ x_j\ to\ state\ x_i\\ &and\\ \pi_{i,i} &= probability\ of\ staying\ in\ state\ x_i\\ &further\\ \pi_{i,j} &\ne \pi_{j,i}\ in\ general \end{align}\]

Example of a Markov Process

To make the foregoing more concrete let’s construct a simple example. We will start with a system of 3 states, \(\{ x_1, x_2, x_3 \}\). The transition matrix is:

\[\Pi = \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \pi_{1,3}\\ \pi_{2,1} & \pi_{2,2} & \pi_{2,3}\\ \pi_{3,1} & \pi_{3,2} & \pi_{3,3} \end{bmatrix} = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \]

Example of a Markov Process

Let’s apply a probability matrix to a set of possible states

\[\vec{X}_{t+1} = \Pi\ \vec{X}_t = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix} \]

From Markov process to Markov chain

The foregoing is a single step of a Markov process

From Markov process to Markov chain

Markov chain comprises a number of state transitions

From Markov process to Markov chain

Start with initial state, \(\vec{x}_0\) for a continuous Markov chain:

\[\Pi\ \Pi\ \Pi\ \ldots \Pi\ \vec{X}_t = \Pi^n\ \vec{X}_t \xrightarrow[\text{$n \rightarrow \infty$}]{} p(\vec{X})\]

MCMC and the Metropolis-Hastings Algorithm

The first MCMC sampling algorithm developed is the Metropolis-Hastings (M-H) algorithm; often referred to as Metropolis algorithm or the M-H algorithm.

  1. Pick a starting point in the parameter space

  2. Choose a nearby point in parameter space randomly from a sampling distribution

  3. Evaluate the posterior at this point

    • Product of the likelihood \(P(data|parameters)\) and prior, \(P(parameters)\)
  4. Use the following decision rule to accept or reject the new sample:

  5. If the sample is accepted, compute the posterior density at the new sample point

  6. Repeat steps, 2, 3, 4 and 5, many times, until convergence

MCMC and the Metropolis-Hastings Algorithm

M-H algorithm uses the following decision rule to accept or reject the new sample:

\[Acceptance\ probability\ = \frac{p(data | new\ parameters)}{p(data | previous\ parameters)}\]

MCMC and the Metropolis-Hastings Algorithm

Eventually, the M-H algorithms converges to the posterior distribution

MCMC and the Metropolis-Hastings Algorithm

Poor convergence arises from low sample efficiency

M-H algorithm example

Let’s try a simple example, find an estimate of the posterior density of a bivariate Normal distribution

\[\mu = \begin{bmatrix} .5\\ .5 \end{bmatrix}\]

\[Covariance = \begin{bmatrix} 1, .6\\ .6, 1 \end{bmatrix}\]

Random samples from a bivariate Normal distribution

Random samples from a bivariate Normal distribution

M-H algorithm example

And, the marginal distributions of the variables

Marginal distributions of bivariate Normal samples

Marginal distributions of bivariate Normal samples

M-H algorithm example

Now, we are ready to sample these data using the M-H MCMC algorithm

The algorithm is

  1. Compute the bi-variate Normal distribution likelihood
  2. Initialize the chain
  3. Initialize some hyperparameters statistics
  4. Sample the likelihood of the data using the M-H algorithm.
MCMC samples from bivariate Normal distribution

MCMC samples from bivariate Normal distribution

M-H algorithm example

Plot the first 500 samples

MCMC samples from bivariate Normal distribution

MCMC samples from bivariate Normal distribution

Notice the long ‘tail’ on the sampled distribution from the burn-in period.

M-H algorithm example

Marginal distributions of the MCMC samples, less first 500

First 500 MCMC samples from bivariate Normal distribution

First 500 MCMC samples from bivariate Normal distribution

These marginals are similar to the original distribution

Convergence and sampling efficiency of MCMC

How can we understand the onvergence properties of the M-H MCMC sampler

For our example these statistics are fairly good:

\[Acceptance\ rate = 0.81\] \[Rejection\ rate = 0.19\]

Evaluation of MCMC sampling

Trace plot of the samples displays the sample value of the parameter with sample number

Trace plots for two variables from the M-H algorithm

Trace plots for two variables from the M-H algorithm

The traces show sampling around the highest density values of the parameters, indicating good convergence

Evaluation of MCMC sampling

An autocorrelation plot shows how a sample value is related to the previous samples

Autocorrelation of Markov chain for the two parameters

Autocorrelation of Markov chain for the two parameters

Evaluation of MCMC sampling

We can relate sampling efficiency to the autocorrelation of the samples

\[ESS = \frac{N}{1 + 2 \sum_k ACF(k)}\]

Other MCMC Sampling Algorithms

A number of other powerful MCMC sampling algorithms have been developed

Gibbs sampling

Gibbs sampler is an improved MCMC sampler which speeds convergence

Gibbs sampling

The basic Gibbs sampler algorithm has the following steps:

  1. For an N dimensional parameter space, \(\{ \theta_1, \theta_2, \ldots, \theta_N \}\), find a random starting point

  2. In order, \(\{1, 2, 3, \ldots, N\}\), assign the next dimension to sample, starting with dimension \(1\); actual order not important

  3. Sample the marginal distribution of the parameter given the observations, \(D\), and other parameter values: \(p(\theta_1|D, \theta_2, \theta_3, \ldots, \theta_N)\)

  4. Repeat steps 2 and 3 until convergence

Hamiltonian MCMC

The Hamiltonian sampler was proposed in 1987 (Duane, et.al.) and uses a simple idea from classical mechanics

Hamiltonian MCMC

The Hamiltonian sampler uses a simple idea from classical mechanics

\[\mathtt{H}(q,p) = \mathtt{K}(p,q) + \mathtt{V}(q)\]

Hamiltonian MCMC

The Hamiltonian sampler uses a simple idea from classical mechanics

\[p(q,p) = e^{-\frac{\mathtt{H}(q,p)}{kT}}\]

Hamiltonian MCMC

The Hamiltonian sampler uses a simple idea from classical mechanics

\[\mathtt{H}(q,p) = \mathtt{K}(p,q) + \mathtt{V}(q)\]

\[\begin{align} \frac{d\ q}{d\ t} &= \frac{\partial \mathtt{H}}{\partial p} = \frac{\partial \mathtt{K}}{\partial p} + \frac{\partial \mathtt{V}}{\partial p}\\ \frac{d\ p}{d\ t} &= \frac{\partial \mathtt{H}}{\partial q} = \frac{\partial \mathtt{K}}{\partial q} + \frac{\partial \mathtt{V}}{\partial q} \end{align}\]

Hamiltonian MCMC

Hamiltonian MCMC is an extension of the M-H algorithm with steps:

  1. Sample \(p \sim \mathtt{N}(0,\sigma)\)
  2. Simulate \(q_t\) and \(p_t\) for L time steps, using the coupled differential equations
  3. Get \(q_L\) as the new proposed state
  4. Apply the M-H acceptance criteria to \(q_L\)

No U-Turn Sampler

NUTS represents the state of the art in MCMC samplers: Hoffman and Gelman 2014

Hamiltonian MCMC

Hamiltonian MCMC is a complex algorithm - some resources for additional details

Key Steps for MCMC Bayes Modeling

Steps of modern Bayes modeling and evaluation

  1. Define the problem
  2. Collect the dataset
  3. Define a model, including priors
  4. MCMC sample the model
  5. Prior predictive checks - are the priors reasonable for the problem?
  6. Verify MCMC sampling convergence and sufficiency
  7. Posterior predictive checks - Do the posterior predictions agree with the observed response?

Packages like ArviZ are dedicated to evaluation of Bayes MCMC models.

Multiple examples diagnostics for the breakdown of MCMC sampling can be found in an online Appendix to a seminal paper Vehtari et.al., 2019

Constructing an MCMC Model with PyMC

Define and construct a simple linear regression model

Data for regression example

Data for regression example

Constructing an MCMC Model with PyMC

Define and construct a simple linear regression model

\[\begin{align} \beta_0 &\sim \mathtt{N}(0, 2)\\ \beta_1 &\sim \mathtt{N}(0, 2)\\ \beta_2 &\sim \mathtt{N}(0, 2)\\ \sigma &\sim |\mathtt{N}(0, 1) \end{align}\]

Constructing an MCMC Model with PyMC

Define and construct a simple linear regression model

\[Y \sim \mathtt{N}(\mu, \sigma)\]

Constructing an MCMC Model with PyMC

Define and construct a simple linear regression model

with pymc3.Model() as regression_model:
    # Priors for unknown model parameters
    betas = pymc3.Normal("betas", mu=0, sigma=2, shape=3)
    sigma = pymc3.HalfNormal("sigma", sigma=1)

    # Deterministic expected value of outcome
    mu = betas[0] + betas[1] * x1 + betas[2] * x2

    # Likelihood (sampling distribution) of observations
    Y_obs = pymc3.Normal("Y_obs", mu=mu, sigma=sigma, observed=Y)

Prior Predictive Checks

Prior predictive checks to verify prior distribution choice

Prior predictive response vs response variable distribution

Prior predictive response vs response variable distribution

Prior Predictive Checks

Prior predictive checks to verify prior distribution choice

Prior density of Betas and sigma

Prior density of Betas and sigma

Sampling and Inference with Model

Evaluation the sampled posterior and perform inference

Traces and posterior density of MCMC traces

Traces and posterior density of MCMC traces

Sampling and Inference with Model

Evaluation the sampled posterior and perform inference

Example from Vehtari et.al., 2021, a seminal paper on evaluation of MCMC

Sampling and Inference with Model

We must verify that the samples are representative

Table of MCMC convergence statistics

Table of MCMC convergence statistics

A lot of information in this summary table

  1. The mean MCMC error
    • Estimated error tells us how much error the MCMC sampling has introduced
  2. The standard deviation (sd) of the mean error of the posterior distribution
  3. Metrics of effective sample size (ESS)
    • ESS of bulk (middle) portion of posterior distributions
    • ESS for tails ESS of posterior distributions
  4. The Gelman-Rudin statistic (\(\hat{R}\)) measures the ratio of the variance shrinkage between chains to the variance shrinkage within chains
    • Gelman-Rudin statistic should converge to 1.0
    • If all chains converge, the reduction in variance between chains and within the chains should is same

Sampling and Inference with Model

We must verify that the samples are representative

MCSE by quantile for each model parameter

MCSE by quantile for each model parameter

Sampling and Inference with Model

We must verify that the samples are representative

ESS locally and by quantile for each model parameter

ESS locally and by quantile for each model parameter

Sampling and Inference with Model

Evaluate the sampled posterior and perform inference

HDIs of model parameters by trace

HDIs of model parameters by trace

Sampling and Inference with Model

Evaluate the sampled posterior and perform inference

Posterior distribution of parameters with HDIs

Posterior distribution of parameters with HDIs

Posterior Predictive Checks

We must check that predictions from the model are reasonable

Posterior Predictive Checks

We must check that predictions from the model are reasonable

\[p(y^{rep}|y)= \sum_i p(y^{rep}|\theta) p(\theta|y)\]

Where:
\(y^{rep} =\) realization drawn from the posterior distribution
\(y =\) observation
\(p(y^{rep}|\theta) =\) draw from posterior distribution with parameters \(\theta\)
\(p(\theta|y) =\) posterior distribution of model parameters, \(\theta\)

Posterior Predictive Checks

We must check that predictions from the model are reasonable

Posterior predictive distribution vs. observed responses

Posterior predictive distribution vs. observed responses

Posterior Predictive Checks

We must check that predictions from the model are reasonable

\[p = p(y^{rep} < y | y )\]

Posterior Predictive Checks

We must check that predictions from the model are reasonable

Distribution of Bayesian p-values

Distribution of Bayesian p-values

Posterior Predictive Checks

We must check that predictions from the model are reasonable

\[p_i = p(y_i^{rep} < y_i | y )\]

Distribution of Bayesian u-values

Distribution of Bayesian u-values

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

Number of disasters per year 1851-1962

Number of disasters per year 1851-1962

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

\[ D_t \sim Pois(r_t),\ r_t \begin{cases} e,\ t < s\\ l,\ t \ge s \end{cases} \]

\[\begin{align} s &\sim unif(t_{min}, t_{max})\\ e &\sim exp(2) \\ l &\sim exp(2) \end{align}\]

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

with pymc3.Model() as disaster_model:
    # Uniform prior on the switch point  
    switchpoint = pymc3.DiscreteUniform("switchpoint", lower=disaster_data.Year.min(), upper=disaster_data.Year.max())
    # More informative prior based on domain knowledge    
#    switchpoint = pymc3.Triangular("switchpoint", lower=disaster_data.Year.min(), upper=disaster_data.Year.max(), c=1900)

    switchpoint = pymc3.DiscreteUniform("switchpoint", lower=disaster_data.Year.min(), upper=disaster_data.Year.max())

    # Priors for pre- and post-switch rates 
    early_rate = pymc3.Exponential("early_rate", 2.0)
    late_rate = pymc3.Exponential("late_rate", 2.0)

    # Poisson rate switch for years before and after current
    rate = pymc3.math.switch(switchpoint >= disaster_data.Year, early_rate, late_rate)

    disasters = pymc3.Poisson("disasters", rate, observed=disaster_data.Count)

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

HDI of parameters by trace

HDI of parameters by trace

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

HDI of parameters by trace

HDI of parameters by trace

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

Posterior distribution of parameters with HDI

Posterior distribution of parameters with HDI

British Coal Mine Disasters

Analysis of change point in British coal mine disasters

Posterior predictive checks

Posterior predictive checks

Summary

For complex Bayesian models we need a computationally efficient aproximation

Summary

Performance metrics of MCMC sampling